Getting Started Wolfram Alpha's website has a nice output summarizing the contents of some public domain books, for instance, A Tale of Two Cities. In this notebook, we'll try to reproduce some of the functionality using NLTK built-in tools.

Rather than using the Text object, we will need to use a Corpus object, since we will want to use the sentence structure. There are some books with sentence structure already downloaded into NLTK's corpus object; in particular the gutenberg collection has a bunch of books in it, so we will use that.

The first things we see are metadata about the book, which we won't try to replicate here (that is more relevant for other ischool courses such as open data or i202, etc).


In [1]:
import nltk
from nltk import corpus

First, let's see the names of the books in the collection.


In [2]:
nltk.corpus.gutenberg.fileids()


Out[2]:
['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

Choose a book Now, let's pick a book to work with. I've decided on one for this exercise: Whitman's Leaves of Grass.


In [3]:
sents= nltk.corpus.gutenberg.sents(fileids='whitman-leaves.txt')
words = nltk.corpus.gutenberg.words(fileids='whitman-leaves.txt')

The opening phrase Getting back to the Wolfram Alpha UI, the first thing we see is the metadata, which we will ignore here (that is a project for 202 or open data class). Next we see it shows the opening phrase. Print that out here:


In [24]:
i = 0
while len(sents[i]) < 10:
    i+=1
    
first_line = sents[i]
" ".join(first_line)


Out[24]:
"Come , said my soul , Such verses for my Body let us write , ( for we are one ,) That should I after return , Or , long , long hence , in other spheres , There to some group of mates the chants resuming , ( Tallying Earth ' s soil , trees , winds , tumultuous waves ,) Ever with pleas ' d smile I may keep on , Ever and ever yet the verses owning -- as , first , I here and now Signing for Soul and Body , set to them my name ,"

Normalizing the text To prepare for the next steps, you should do some normalization of the text. Although this will cause some errors, you should first

  1. remove punctuation
  2. lowercase the text

In [28]:
normal_words = [w.lower() for w in words if w.isalpha()]
normal_words


Out[28]:
['leaves',
 'of',
 'grass',
 'by',
 'walt',
 'whitman',
 'come',
 'said',
 'my',
 'soul',
 'such',
 'verses',
 'for',
 'my',
 'body',
 'let',
 'us',
 'write',
 'for',
 'we',
 'are',
 'one',
 'that',
 'should',
 'i',
 'after',
 'return',
 'or',
 'long',
 'long',
 'hence',
 'in',
 'other',
 'spheres',
 'there',
 'to',
 'some',
 'group',
 'of',
 'mates',
 'the',
 'chants',
 'resuming',
 'tallying',
 'earth',
 's',
 'soil',
 'trees',
 'winds',
 'tumultuous',
 'waves',
 'ever',
 'with',
 'pleas',
 'd',
 'smile',
 'i',
 'may',
 'keep',
 'on',
 'ever',
 'and',
 'ever',
 'yet',
 'the',
 'verses',
 'owning',
 'as',
 'first',
 'i',
 'here',
 'and',
 'now',
 'signing',
 'for',
 'soul',
 'and',
 'body',
 'set',
 'to',
 'them',
 'my',
 'name',
 'walt',
 'whitman',
 'book',
 'i',
 'inscriptions',
 'one',
 's',
 'self',
 'i',
 'sing',
 'one',
 's',
 'self',
 'i',
 'sing',
 'a',
 'simple',
 'separate',
 'person',
 'yet',
 'utter',
 'the',
 'word',
 'democratic',
 'the',
 'word',
 'en',
 'masse',
 'of',
 'physiology',
 'from',
 'top',
 'to',
 'toe',
 'i',
 'sing',
 'not',
 'physiognomy',
 'alone',
 'nor',
 'brain',
 'alone',
 'is',
 'worthy',
 'for',
 'the',
 'muse',
 'i',
 'say',
 'the',
 'form',
 'complete',
 'is',
 'worthier',
 'far',
 'the',
 'female',
 'equally',
 'with',
 'the',
 'male',
 'i',
 'sing',
 'of',
 'life',
 'immense',
 'in',
 'passion',
 'pulse',
 'and',
 'power',
 'cheerful',
 'for',
 'freest',
 'action',
 'form',
 'd',
 'under',
 'the',
 'laws',
 'divine',
 'the',
 'modern',
 'man',
 'i',
 'sing',
 'as',
 'i',
 'ponder',
 'd',
 'in',
 'silence',
 'as',
 'i',
 'ponder',
 'd',
 'in',
 'silence',
 'returning',
 'upon',
 'my',
 'poems',
 'considering',
 'lingering',
 'long',
 'a',
 'phantom',
 'arose',
 'before',
 'me',
 'with',
 'distrustful',
 'aspect',
 'terrible',
 'in',
 'beauty',
 'age',
 'and',
 'power',
 'the',
 'genius',
 'of',
 'poets',
 'of',
 'old',
 'lands',
 'as',
 'to',
 'me',
 'directing',
 'like',
 'flame',
 'its',
 'eyes',
 'with',
 'finger',
 'pointing',
 'to',
 'many',
 'immortal',
 'songs',
 'and',
 'menacing',
 'voice',
 'what',
 'singest',
 'thou',
 'it',
 'said',
 'know',
 'st',
 'thou',
 'not',
 'there',
 'is',
 'hut',
 'one',
 'theme',
 'for',
 'ever',
 'enduring',
 'bards',
 'and',
 'that',
 'is',
 'the',
 'theme',
 'of',
 'war',
 'the',
 'fortune',
 'of',
 'battles',
 'the',
 'making',
 'of',
 'perfect',
 'soldiers',
 'be',
 'it',
 'so',
 'then',
 'i',
 'answer',
 'd',
 'i',
 'too',
 'haughty',
 'shade',
 'also',
 'sing',
 'war',
 'and',
 'a',
 'longer',
 'and',
 'greater',
 'one',
 'than',
 'any',
 'waged',
 'in',
 'my',
 'book',
 'with',
 'varying',
 'fortune',
 'with',
 'flight',
 'advance',
 'and',
 'retreat',
 'victory',
 'deferr',
 'd',
 'and',
 'wavering',
 'yet',
 'methinks',
 'certain',
 'or',
 'as',
 'good',
 'as',
 'certain',
 'at',
 'the',
 'last',
 'the',
 'field',
 'the',
 'world',
 'for',
 'life',
 'and',
 'death',
 'for',
 'the',
 'body',
 'and',
 'for',
 'the',
 'eternal',
 'soul',
 'lo',
 'i',
 'too',
 'am',
 'come',
 'chanting',
 'the',
 'chant',
 'of',
 'battles',
 'i',
 'above',
 'all',
 'promote',
 'brave',
 'soldiers',
 'in',
 'cabin',
 'd',
 'ships',
 'at',
 'sea',
 'in',
 'cabin',
 'd',
 'ships',
 'at',
 'sea',
 'the',
 'boundless',
 'blue',
 'on',
 'every',
 'side',
 'expanding',
 'with',
 'whistling',
 'winds',
 'and',
 'music',
 'of',
 'the',
 'waves',
 'the',
 'large',
 'imperious',
 'waves',
 'or',
 'some',
 'lone',
 'bark',
 'buoy',
 'd',
 'on',
 'the',
 'dense',
 'marine',
 'where',
 'joyous',
 'full',
 'of',
 'faith',
 'spreading',
 'white',
 'sails',
 'she',
 'cleaves',
 'the',
 'ether',
 'mid',
 'the',
 'sparkle',
 'and',
 'the',
 'foam',
 'of',
 'day',
 'or',
 'under',
 'many',
 'a',
 'star',
 'at',
 'night',
 'by',
 'sailors',
 'young',
 'and',
 'old',
 'haply',
 'will',
 'i',
 'a',
 'reminiscence',
 'of',
 'the',
 'land',
 'be',
 'read',
 'in',
 'full',
 'rapport',
 'at',
 'last',
 'here',
 'are',
 'our',
 'thoughts',
 'voyagers',
 'thoughts',
 'here',
 'not',
 'the',
 'land',
 'firm',
 'land',
 'alone',
 'appears',
 'may',
 'then',
 'by',
 'them',
 'be',
 'said',
 'the',
 'sky',
 'o',
 'erarches',
 'here',
 'we',
 'feel',
 'the',
 'undulating',
 'deck',
 'beneath',
 'our',
 'feet',
 'we',
 'feel',
 'the',
 'long',
 'pulsation',
 'ebb',
 'and',
 'flow',
 'of',
 'endless',
 'motion',
 'the',
 'tones',
 'of',
 'unseen',
 'mystery',
 'the',
 'vague',
 'and',
 'vast',
 'suggestions',
 'of',
 'the',
 'briny',
 'world',
 'the',
 'liquid',
 'flowing',
 'syllables',
 'the',
 'perfume',
 'the',
 'faint',
 'creaking',
 'of',
 'the',
 'cordage',
 'the',
 'melancholy',
 'rhythm',
 'the',
 'boundless',
 'vista',
 'and',
 'the',
 'horizon',
 'far',
 'and',
 'dim',
 'are',
 'all',
 'here',
 'and',
 'this',
 'is',
 'ocean',
 's',
 'poem',
 'then',
 'falter',
 'not',
 'o',
 'book',
 'fulfil',
 'your',
 'destiny',
 'you',
 'not',
 'a',
 'reminiscence',
 'of',
 'the',
 'land',
 'alone',
 'you',
 'too',
 'as',
 'a',
 'lone',
 'bark',
 'cleaving',
 'the',
 'ether',
 'purpos',
 'd',
 'i',
 'know',
 'not',
 'whither',
 'yet',
 'ever',
 'full',
 'of',
 'faith',
 'consort',
 'to',
 'every',
 'ship',
 'that',
 'sails',
 'sail',
 'you',
 'bear',
 'forth',
 'to',
 'them',
 'folded',
 'my',
 'love',
 'dear',
 'mariners',
 'for',
 'you',
 'i',
 'fold',
 'it',
 'here',
 'in',
 'every',
 'leaf',
 'speed',
 'on',
 'my',
 'book',
 'spread',
 'your',
 'white',
 'sails',
 'my',
 'little',
 'bark',
 'athwart',
 'the',
 'imperious',
 'waves',
 'chant',
 'on',
 'sail',
 'on',
 'bear',
 'o',
 'er',
 'the',
 'boundless',
 'blue',
 'from',
 'me',
 'to',
 'every',
 'sea',
 'this',
 'song',
 'for',
 'mariners',
 'and',
 'all',
 'their',
 'ships',
 'to',
 'foreign',
 'lands',
 'i',
 'heard',
 'that',
 'you',
 'ask',
 'd',
 'for',
 'something',
 'to',
 'prove',
 'this',
 'puzzle',
 'the',
 'new',
 'world',
 'and',
 'to',
 'define',
 'america',
 'her',
 'athletic',
 'democracy',
 'therefore',
 'i',
 'send',
 'you',
 'my',
 'poems',
 'that',
 'you',
 'behold',
 'in',
 'them',
 'what',
 'you',
 'wanted',
 'to',
 'a',
 'historian',
 'you',
 'who',
 'celebrate',
 'bygones',
 'who',
 'have',
 'explored',
 'the',
 'outward',
 'the',
 'surfaces',
 'of',
 'the',
 'races',
 'the',
 'life',
 'that',
 'has',
 'exhibited',
 'itself',
 'who',
 'have',
 'treated',
 'of',
 'man',
 'as',
 'the',
 'creature',
 'of',
 'politics',
 'aggregates',
 'rulers',
 'and',
 'priests',
 'i',
 'habitan',
 'of',
 'the',
 'alleghanies',
 'treating',
 'of',
 'him',
 'as',
 'he',
 'is',
 'in',
 'himself',
 'in',
 'his',
 'own',
 'rights',
 'pressing',
 'the',
 'pulse',
 'of',
 'the',
 'life',
 'that',
 'has',
 'seldom',
 'exhibited',
 'itself',
 'the',
 'great',
 'pride',
 'of',
 'man',
 'in',
 'himself',
 'chanter',
 'of',
 'personality',
 'outlining',
 'what',
 'is',
 'yet',
 'to',
 'be',
 'i',
 'project',
 'the',
 'history',
 'of',
 'the',
 'future',
 'to',
 'thee',
 'old',
 'cause',
 'to',
 'thee',
 'old',
 'cause',
 'thou',
 'peerless',
 'passionate',
 'good',
 'cause',
 'thou',
 'stern',
 'remorseless',
 'sweet',
 'idea',
 'deathless',
 'throughout',
 'the',
 'ages',
 'races',
 'lands',
 'after',
 'a',
 'strange',
 'sad',
 'war',
 'great',
 'war',
 'for',
 'thee',
 'i',
 'think',
 'all',
 'war',
 'through',
 'time',
 'was',
 'really',
 'fought',
 'and',
 'ever',
 'will',
 'be',
 'really',
 'fought',
 'for',
 'thee',
 'these',
 'chants',
 'for',
 'thee',
 'the',
 'eternal',
 'march',
 'of',
 'thee',
 'a',
 'war',
 'o',
 'soldiers',
 'not',
 'for',
 'itself',
 'alone',
 'far',
 'far',
 'more',
 'stood',
 'silently',
 'waiting',
 'behind',
 'now',
 'to',
 'advance',
 'in',
 'this',
 'book',
 'thou',
 'orb',
 'of',
 'many',
 'orbs',
 'thou',
 'seething',
 'principle',
 'thou',
 'well',
 'kept',
 'latent',
 'germ',
 'thou',
 'centre',
 'around',
 'the',
 'idea',
 'of',
 'thee',
 'the',
 'war',
 'revolving',
 'with',
 'all',
 'its',
 'angry',
 'and',
 'vehement',
 'play',
 'of',
 'causes',
 'with',
 'vast',
 'results',
 'to',
 'come',
 'for',
 'thrice',
 'a',
 'thousand',
 'years',
 'these',
 'recitatives',
 'for',
 'thee',
 'my',
 'book',
 'and',
 'the',
 'war',
 'are',
 'one',
 'merged',
 'in',
 'its',
 'spirit',
 'i',
 'and',
 'mine',
 'as',
 'the',
 'contest',
 'hinged',
 'on',
 'thee',
 'as',
 'a',
 'wheel',
 'on',
 'its',
 'axis',
 'turns',
 'this',
 'book',
 'unwitting',
 'to',
 'itself',
 'around',
 'the',
 'idea',
 'of',
 'thee',
 'eidolons',
 'i',
 'met',
 'a',
 'seer',
 'passing',
 'the',
 'hues',
 'and',
 'objects',
 'of',
 'the',
 'world',
 'the',
 'fields',
 'of',
 'art',
 'and',
 'learning',
 'pleasure',
 'sense',
 'to',
 'glean',
 'eidolons',
 'put',
 'in',
 'thy',
 'chants',
 'said',
 'he',
 'no',
 'more',
 'the',
 'puzzling',
 'hour',
 'nor',
 'day',
 'nor',
 'segments',
 'parts',
 'put',
 'in',
 'put',
 'first',
 'before',
 'the',
 'rest',
 'as',
 'light',
 'for',
 'all',
 'and',
 'entrance',
 'song',
 'of',
 'all',
 'that',
 'of',
 'eidolons',
 'ever',
 'the',
 'dim',
 'beginning',
 'ever',
 'the',
 'growth',
 'the',
 'rounding',
 'of',
 'the',
 'circle',
 'ever',
 'the',
 'summit',
 'and',
 'the',
 'merge',
 'at',
 'last',
 'to',
 'surely',
 'start',
 'again',
 'eidolons',
 'eidolons',
 'ever',
 'the',
 'mutable',
 ...]

Creating a table Python provides an easy way to line columns up in a table. You can specify a width for a string such as %6s, producing a string that is padded to width 6. It is right-justified by default, but a minus sign in front of it switches it to left-justified, so -3d% means left justify an integer with width 3. AND if you don't know the width in advance, you can make it a variable by using an asterisk rather than a number before the '*s%' or the '-*d%'. Check out this example:


In [29]:
print '%-16s' % 'Info type', '%-16s' % 'Value'
print '%-16s' % 'number of words', '%-16d' % 100000


Info type        Value           
number of words  100000          

Word Properties Table Next there is a table of word properties, which you should compute (skip unique word stems, since we will learn how to do that in the coming weeks). Make a table that prints out:

  1. number of words
  2. number of unique words
  3. average word length
  4. longest word

You can make your table look prettier than the example I showed above if you like!


In [36]:
print "{:<16}{:<16}{:<16}{:<16}"\
    .format(len(words), len(set(words)), average([len(w) for w in words]), "longest")


154883          14329           3.69624813569   longest         

Most Frequent Words List Next is the most frequent words list. This table shows the percent of the total as well as the most frequent words, so compute this number as well. Their formatting is nice, so try to replicate that too (although I don't think you can make alternative fonts in ipython notebook output, but if someone knows otherwise I'd love to hear about it).


In [ ]:

Most Frequent Capitalized Words List Oops! We normalized the text! Well, this is a tough one because you really don't want to count the words that are capitalized just because they appeared at the start of a sentence. We should leave this one for later after we've learned some other techniques.

Most Frequent Two Word Phrases Ah, there is a nice little trick for doing this one. Can you think of some ways to do it (other than using the bigram tool that is built into the Text object)?


In [ ]:

Sentence Properties This is a piece of cake.


In [ ]: